As employment environment becomes severe, the employment issue of graduates has always been the focus of public opinion. It is of far-reaching significance to explore the impact of multiple factors on graduates' salaries and put forward suggestions and strategies in related fields. This report draws on survey data from the Higher Education Statistics Agency on the baseline profile of UK graduates in the 2020-2021 academic year. The goal is further data visualization, analysis and modeling. The survey indicator on graduate wages covers the annual earnings (before tax) of graduates in their main jobs during the census week.
The survey covers graduates from a range of subjects across England, Scotland, Northern Ireland and Wales, ranging from first-year undergraduates to students studying for a master's degree. A variety of routes to degree qualifications are considered. Institutional data contributions are from higher education institutions and colleges of further education.
Job classifications in the survey were divided into high, medium and low categories based on constructs such as "skill level" and "skill specialization." These classifications take into account factors such as training duration, necessary work experience, and knowledge prerequisites for task performance . Additionally, the survey asked whether graduates view work as an activity or as a primary pursuit.
We have a strong interest in investigating income differences between graduates with different backgrounds, geographical areas and career choices. Our goal is to gain a comprehensive understanding of the initial earnings of graduates exhibiting different characteristics. Furthermore, we aim to explore the impact of graduates' chosen field of specialization and the degree pursued during their academic tenure on their subsequent earnings levels. The subtle interplay between these educational and occupational factors and the types of jobs graduates choose subsequently is of particular interest in explaining the determinants of earnings changes early in their careers .
(a) Conduct exploratory data analysis to review the data set, employing advanced data visualization techniques to analyze salary distributions between different categories of graduates.
(b) Use the training data set to develop a classification model with the goal of predicting salary levels, using the basic characteristics of graduates as predictive features.
(c) Investigate the importance of factors that affect graduates’ salary levels, aiming to establish an importance ranking of factors based on income levels and quantify the relative impact of various factors on graduates’ salary fluctuations.
#!pip install missingno
!pip install pywaffle
Requirement already satisfied: pywaffle in /usr/local/lib/python3.10/dist-packages (1.1.0) Requirement already satisfied: fontawesomefree in /usr/local/lib/python3.10/dist-packages (from pywaffle) (6.5.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from pywaffle) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.2.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (4.47.2) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.4.5) Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.23.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (23.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (2.8.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->pywaffle) (1.16.0)
# Web Scrapping part
import requests
from bs4 import BeautifulSoup
import os
import zipfile
# EDA part
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
import networkx as nx
import plotly.graph_objects as go
from pywaffle import Waffle
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import plotly as py
import plotly.graph_objs as go
plt.rcParams['figure.dpi'] = 140
# Modelling part
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from keras import models
from keras import layers
In this section we will explain the details on how our dataset was scraped.
url = 'https://www.data.gov.uk/dataset/37b401c3-1689-4f3c-bac4-b6cc39cdefa7/higher-education-graduate-outcomes-data'
page=requests.get(url)
soup=BeautifulSoup(page.content,'lxml')
td = soup.find_all('td')[0]
zip_url = td.find('a')['href'] # Get the url of our desired zip file
folder_name = 'table-30'
response = requests.get(zip_url)
os.makedirs(folder_name, exist_ok=True)
zip_file_path = os.path.join(folder_name, "table-30.zip")
with open(zip_file_path, "wb") as file:
file.write(response.content) # Save the content to a local ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(folder_name) # Extract the contents of the ZIP file
current_directory = os.getcwd() # Get the current working directory then join together
csv_file_path = os.path.join(current_directory,'table-30', "table-30-2020-21.csv")
# We will only explore the updated dataset
df = pd.read_csv(csv_file_path, header = 14)
df.head()
| Subject area of degree | Country of provider | Provider type | Level of qualification obtained | Mode of former study | Skill group | Work population marker | Salary band | Academic year | Number | Percent | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01 Medicine and dentistry | All | All | All | All | All | Paid employment is an activity | Less than £15,000 | 2020/21 | 0 | 0% |
| 1 | 01 Medicine and dentistry | All | All | All postgraduate | All | All | Paid employment is an activity | Less than £15,000 | 2020/21 | 0 | 0% |
| 2 | 01 Medicine and dentistry | All | All | All undergraduate | All | All | Paid employment is an activity | Less than £15,000 | 2020/21 | 0 | 0% |
| 3 | 01 Medicine and dentistry | All | All | First degree | All | All | Paid employment is an activity | Less than £15,000 | 2020/21 | 0 | 0% |
| 4 | 01 Medicine and dentistry | All | All | Other undergraduate | All | All | Paid employment is an activity | Less than £15,000 | 2020/21 | 0 | 0% |
# Drop null values
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 371700 entries, 0 to 646348 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Subject area of degree 371700 non-null object 1 Country of provider 371700 non-null object 2 Provider type 371700 non-null object 3 Level of qualification obtained 371700 non-null object 4 Mode of former study 371700 non-null object 5 Skill group 371700 non-null object 6 Work population marker 371700 non-null object 7 Salary band 371700 non-null object 8 Academic year 371700 non-null object 9 Number 371700 non-null int64 10 Percent 371700 non-null object dtypes: int64(1), object(10) memory usage: 34.0+ MB
Initially, there was repetitive counting in the 'Number' column, mainly attributed to the 'Total' and 'All' classes. Subdatasets were subsequently filtered based on specific conditions.
# Only Keep total
c0 = df['Subject area of degree'] == 'Total'
c1 = df['Country of provider'] == 'All'
c2 = df['Provider type'] == 'All'
c3 = df['Level of qualification obtained'] == 'All'
c4 = df['Mode of former study'] == 'All'
c5 = df['Skill group'] == 'All'
c6 = df['Work population marker'] == 'Paid employment is an activity'
c7 = df['Salary band'] == 'Total'
c0n = ~df['Subject area of degree'].isin(['Total','Total non-science CAH level 1','Total science CAH level 1'])
c1n = df['Country of provider'] != 'All'
c2n = df['Provider type'] != 'All'
c3n = df['Level of qualification obtained'] != 'All'
c4n = df['Mode of former study'] != 'All'
c5n = df['Skill group'] != 'All'
c7n = df['Salary band'] != 'Total'
df_subject = df[c0n & c1 & c2 & c3 & c4 & c5 & c6 & c7] # Only keep subjects, others are total
subject = df_subject['Subject area of degree'].unique().tolist()
y = df_subject['Number']
fig = go.Figure(go.Treemap(
labels = subject,
parents = ['Subject'] * len(y),
values = y
))
fig.update_layout(title = 'Number of Survey Participants in each subject')
fig.show()
This plot illustrates the variation in the number of survey participants across different 22 subjects. The individual count number can be checked when clicking the certain blocks.
It is evident that significant differences exist between subjects. Consequently, in the subsequent salary comparison section, percentages are considered instead of raw counts.
df_salary = df[c0 & c1 & c2 & c3 & c4 & c5 & c6 & c7n]
salary = df_salary['Salary band'].unique().tolist()
count_salary = pd.pivot_table(df_salary, values='Number',index='Salary band', aggfunc='sum')
count_salary_reset = count_salary.reset_index()
color_map = ['#20639B' for _ in range(14)]
color_map[4] = '#ED553B' # color highlight
# initialize the figure
plt.figure(figsize=(8,8),dpi=200)
ax = plt.subplot(111, polar=True)
plt.axis('off')
# Constants = parameters controling the plot layout:
upperLimit = 20
lowerLimit = 1
labelPadding = 20
# Compute max and min in the dataset
max = count_salary_reset['Number'].max()
# Let's compute heights: they are a conversion of each item value in those new coordinates
# In our example, 0 in the dataset will be converted to the lowerLimit (10)
# The maximum will be converted to the upperLimit (100)
slope = (max - lowerLimit) / max
heights = slope * count_salary_reset.Number + lowerLimit
# Compute the width of each bar. In total we have 2*Pi = 360°
width = 2*np.pi / len(count_salary_reset.index)
# Compute the angle each bar is centered on:
indexes = list(range(1, len(count_salary_reset.index)+1))
angles = [element * width for element in indexes]
angles
# Draw bars
bars = ax.bar(
x=angles,
height=heights,
width=width,
bottom=lowerLimit,
linewidth=2,
edgecolor="white",
color=color_map,alpha=0.8
)
# Add labels
for bar, angle, height, label in zip(bars,angles, heights, count_salary_reset["Salary band"]):
# Labels are rotated. Rotation must be specified in degrees :(
rotation = np.rad2deg(angle)
# Flip some labels upside down
alignment = ""
if angle >= np.pi/2 and angle < 3*np.pi/2:
alignment = "right"
rotation = rotation + 180
else:
alignment = "left"
# Finally add the labels
ax.text(
x=angle,
y=lowerLimit + bar.get_height() + labelPadding,
s=label,
ha=alignment, fontsize=6,fontfamily='serif',
va='center',
rotation=rotation,
rotation_mode="anchor")
The pie plot depicts the distribution of participants among 14 salary bands, spanning from 'Less than £15,000' to '£51,000+'. Each section represents a distinct population size. The visualization emphasizes that the majority of students received salaries ranging from £24,000 to £27,000 annually upon graduation.
The qualification levels include both undergraduate and postgraduate categories. Within the undergraduate category, there are three classes: First degree, Other undergraduate, and Undergraduate unknown. In the postgraduate category, it includes Postgraduate (research), Postgraduate (taught). The qualification distribution plot specifically focuses on these subclasses.
df_qualification = df[c0 & c1 & c2 & c3n & c4 & c5 & c6 & c7]
qualification = df_qualification['Level of qualification obtained'].unique().tolist()
count_qualification = df_qualification['Number']
count_qualification.index = qualification
fig = plt.figure(
FigureClass=Waffle,
rows=10,
columns=24,
values=count_qualification[2:9],
colors = ('#20639B', '#ED553B', '#3CAEA3', '#F5D55C', '#845EC2'),
title={'label': 'Qualification Distribution', 'loc': 'left','fontsize': 20},
labels=["{}({})".format(a, b) for a, b in zip(count_qualification.index[2:9], count_qualification[2:9]) ],
legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.3), 'ncol': 3, 'framealpha': 0, 'fontsize': 15},
font_size=30,
icons = 'child',
figsize=(50, 10),
icon_legend=True
)
The predominant group of participants possesses an undergraduate qualification at the first-degree level. Additionally, there are only 10 participants with an unknown qualification at the postgraduate level.
They are 'Country of provider','Provider type','Mode of former study','Skill group'
df_sub = df[c0 & c1n & c2n & c3 & c4n & c5n & c6 & c7]
pd.pivot_table(df_sub, values='Number',index=['Country of provider','Provider type','Mode of former study','Skill group'], aggfunc='sum')
| Number | ||||
|---|---|---|---|---|
| Country of provider | Provider type | Mode of former study | Skill group | |
| England | Further education colleges (FECs) | Full-time | High skilled | 1575 |
| Low skilled | 440 | |||
| Medium skilled | 755 | |||
| Part-time | High skilled | 1625 | ||
| Low skilled | 80 | |||
| Medium skilled | 450 | |||
| Higher education providers (HEPs) | Full-time | High skilled | 94160 | |
| Low skilled | 5680 | |||
| Medium skilled | 12895 | |||
| Part-time | High skilled | 21860 | ||
| Low skilled | 420 | |||
| Medium skilled | 1605 | |||
| Northern Ireland | Further education colleges (FECs) | Full-time | High skilled | 110 |
| Low skilled | 25 | |||
| Medium skilled | 55 | |||
| Part-time | High skilled | 350 | ||
| Low skilled | 25 | |||
| Medium skilled | 150 | |||
| Higher education providers (HEPs) | Full-time | High skilled | 2920 | |
| Low skilled | 110 | |||
| Medium skilled | 290 | |||
| Part-time | High skilled | 1105 | ||
| Low skilled | 25 | |||
| Medium skilled | 100 | |||
| Scotland | Higher education providers (HEPs) | Full-time | High skilled | 10615 |
| Low skilled | 620 | |||
| Medium skilled | 1345 | |||
| Part-time | High skilled | 2270 | ||
| Low skilled | 55 | |||
| Medium skilled | 255 | |||
| Wales | Further education colleges (FECs) | Part-time | High skilled | 35 |
| Higher education providers (HEPs) | Full-time | High skilled | 5550 | |
| Low skilled | 455 | |||
| Medium skilled | 870 | |||
| Part-time | High skilled | 1350 | ||
| Low skilled | 40 | |||
| Medium skilled | 155 |
From the table, it is evident that the highest number of students received full-time higher education in England and possess high skills.
df_work_salary = df[c0 & c1 & c2 & c3 & c4 & c5 & c7n]
df_work_salary['Salary band'].value_counts().index
count_work_salary = pd.pivot_table(df_work_salary, values='Number',index='Salary band',columns='Work population marker', aggfunc='sum')
count_work_salary
| Work population marker | Paid employment is an activity | Paid employment is most important activity |
|---|---|---|
| Salary band | ||
| Less than £15,000 | 1190 | 1100 |
| £15,000 - £17,999 | 4655 | 4335 |
| £18,000 - £20,999 | 16615 | 15860 |
| £21,000 - £23,999 | 22135 | 21315 |
| £24,000 - £26,999 | 36210 | 35230 |
| £27,000 - £29,999 | 23460 | 22955 |
| £30,000 - £32,999 | 20105 | 19545 |
| £33,000 - £35,999 | 12355 | 12060 |
| £36,000 - £38,999 | 6565 | 6395 |
| £39,000 - £41,999 | 6550 | 6340 |
| £42,000 - £44,999 | 3600 | 3515 |
| £45,000 - £47,999 | 4085 | 3975 |
| £48,000 - £50,999 | 3510 | 3405 |
| £51,000+ | 9425 | 9195 |
Upon reviewing the pivot plot of 'Salary band' and 'Work population marker,' it is evident that there is no significant difference between the classes 'Paid employment is an activity' and 'Paid employment is the most important activity'. Therefore, for the comparison and modeling sections, we will focus exclusively on the class 'Paid employment is an activity' as it encompasses the other class.
df_subject_salary = df[c0n & c1 & c2 & c3 & c4 & c5 & c6 & c7n]
new_salary = ['Less than £21k','£21k - £30k','£30k - £39k','£39k - £48k','£48k+']
# Combine some salary bands
new_cate = {salary[0]: new_salary[0],salary[1]: new_salary[0], salary[2]: new_salary[0],
salary[3]: new_salary[1], salary[4]: new_salary[1], salary[5]: new_salary[1],
salary[6]: new_salary[2], salary[7]: new_salary[2], salary[8]:new_salary[2],
salary[9]: new_salary[3], salary[10]: new_salary[3], salary[11]: new_salary[3],
salary[12]: new_salary[4], salary[13]: new_salary[4]}
df_subject_salary['New salary band'] = df_subject_salary['Salary band'].map(new_cate)
df_subject_salary['Percent'] = df_subject_salary['Percent'].str.strip('%').astype(float)/100
y_data = list(df_subject_salary['Subject area of degree'].unique())
x_data = np.zeros((len(y_data),len(new_salary)))
for i,s in enumerate(y_data):
a = df_subject_salary[df_subject_salary['Subject area of degree'] == s].groupby('New salary band')['Percent'].sum().reset_index()
x_data[i, :] = a['Percent'].to_numpy()
colors = [
"rgba(80, 150, 250, 1.0)",
"rgba(105, 176, 250, 1.0)",
"rgba(237, 213, 92, 0.8)",
"rgba(255, 165, 0, 0.8)",
"rgba(255, 140, 0, 0.8)",
]
fig = go.Figure()
for i in range(0, len(x_data[0])):
for xd, yd in zip(x_data, y_data):
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colors[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
fig.update_layout(
xaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
domain=[0.15, 1]
),
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
),
barmode='stack',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
margin=dict(l=120, r=10, t=140, b=80),
showlegend=False,
)
annotations = []
for yd, xd in zip(y_data, x_data):
# labeling the y-axis
annotations.append(dict(xref='paper', yref='y',
x=0.14, y=yd,
xanchor='right',
text=str(yd),
font=dict(size=6,
color='rgb(67, 67, 67)'),
showarrow=False, align='right'))
# labeling the first percentage of each bar (x_axis)
annotations.append(dict(xref='x', yref='y',
x=xd[0] / 2, y=yd,
text=str(round(xd[0], 2)),
font=dict(size=7,
color='rgb(248, 248, 255)'),
showarrow=False))
# labeling the first Likert scale (on the top)
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=xd[0] / 2, y=1.1,
text=new_salary[0],
font=dict(size=5,
color='rgb(67, 67, 67)'),
showarrow=False))
space = xd[0]
for i in range(1, len(xd)):
# labeling the rest of percentages for each bar (x_axis)
annotations.append(dict(xref='x', yref='y',
x=space + (xd[i]/2), y=yd,
text=str(round(xd[i], 2)),
font=dict(size=7,
color='rgb(248, 248, 255)'),
showarrow=False))
# labeling the Likert scale
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=space + (xd[i]/2), y=1.1,
text=new_salary[i],
font=dict(size=5,
color='rgb(67, 67, 67)'),
showarrow=False))
space += xd[i]
fig.update_layout(
title="Salary band by subject",
annotations=annotations)
fig.show()
It is evident that students in the field of '01 Medicine and dentistry' tend to have the highest salaries, whereas those in '25 Design, and creative and performing arts' havecomparatively lowersalaries.
Considering both this plot and the popularity of the subjects, we will include these 11 subjects in the modelling part:
01 Medicine and dentistry
07 Physical science
09 Mathematical sciences
10 Engineering and technology
11 Computing
15 Social sciences
16 Law
17 Business and management
22 Education and teaching
25 Design, and creative and performing arts.
df_country_salary = df[c0 & c1n & c2 & c3 & c4 & c5 & c6 & c7n]
df_country_salary = df_country_salary[['Country of provider', 'Salary band','Percent']].copy()
df_country_salary['Percent'] = df_country_salary['Percent'].str.strip('%').astype(float)/100
percent_country_salary = pd.pivot_table(df_country_salary, values='Percent', index='Salary band', columns='Country of provider', aggfunc='sum')
label = ['<£15k',
'£15k - £18k',
'£18k - £21k',
'£21k - £24k',
'£24k - £27k',
'£27k - £30k',
'£30k - £33k',
'£33k - £36k',
'£36k - £39k',
'£39k - £42k',
'£42k - £45k',
'£45k - £48k',
'£48k - £51k',
'£51k+']
fig, ax = plt.subplots(1, 1, figsize=(20, 5))
x_axis = np.arange(len(label))
color = ['#20639B', '#ED553B', '#3CAEA3', '#F5D55C']
plt.bar(x_axis - 0.3, percent_country_salary['England'], 0.2, label = 'England',color=color[0])
plt.bar(x_axis - 0.1, percent_country_salary['Northern Ireland'], 0.2, label = 'Northern Ireland',color=color[1])
plt.bar(x_axis + 0.1, percent_country_salary['Scotland'], 0.2, label = 'Scotland',color=color[2])
plt.bar(x_axis + 0.3, percent_country_salary['Wales'], 0.2, label = 'Wales',color=color[3])
plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)
plt.xlabel("Salary Band", fontsize=15)
plt.ylabel("Percent", fontsize=15)
plt.title("Salary band by counrty", fontsize=20)
plt.legend()
plt.show()
While there is a slight difference in percentages among the four areas, the numbers for the other three categories are significantly lower than those for England. Therefore, it is reasonable to combine all the countries together, and ignore the 'Country of provider' variable for the modeling part.
df_prov_salary = df[c0 & c1 & c2n & c3 & c4 & c5 & c6 & c7n]
df_prov_salary = df_prov_salary[['Provider type', 'Salary band', 'Number']].copy()
count_prov_salary = pd.pivot_table(df_prov_salary, values='Number', index='Salary band', columns='Provider type', aggfunc='sum')
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
color = ['#20639B', '#ED553B']
FEC = count_prov_salary['Further education colleges (FECs)'].reset_index()
HEP = count_prov_salary['Higher education providers (HEPs)'].reset_index()
ax.plot(HEP.index, HEP['Higher education providers (HEPs)'], color=color[0], label='HEP')
ax.fill_between(HEP.index, 0, HEP['Higher education providers (HEPs)'], color=color[0], alpha=0.9)
ax.plot(FEC.index, FEC['Further education colleges (FECs)'], color=color[1], label='FEC')
ax.fill_between(FEC.index, 0, FEC['Further education colleges (FECs)'], color=color[1], alpha=0.9)
ax.yaxis.tick_right()
ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
for s in ['top', 'right','bottom','left']:
ax.spines[s].set_visible(False)
ax.grid(False)
x_axis = np.arange(len(label))
plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)
fig.text(0.13, 0.85, 'Salary band by provider', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.13,0.5,"FEC", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.17,0.5,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.18,0.5,"HEP", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')
ax.tick_params(axis=u'both', which=u'both',length=0)
plt.show()
Due to limited data provided by FEC, our modeling efforts will focus exclusively on Higher Education Providers (HEP).
df_mode_salary = df[c0 & c1 & c2 & c3 & c4n & c5 & c6 & c7n]
df_mode_salary['Percent'] = df_mode_salary['Percent'].str.strip('%').astype(float)/100
df_mode_salary = df_mode_salary[['Mode of former study', 'Salary band', 'Percent']].copy()
pivot_mode_salary = pd.pivot_table(df_mode_salary, values='Percent', index='Salary band', columns='Mode of former study', aggfunc='sum')
pivot_mode_salary1 = pivot_mode_salary.reset_index()
full = pivot_mode_salary1['Full-time']
part = - pivot_mode_salary1['Part-time']
full = pivot_mode_salary1['Full-time']*100
part = - pivot_mode_salary1['Part-time']*100
fig, ax = plt.subplots(1,1, figsize=(12, 6))
ax.bar(label, full, width=0.5, color='#20639B', alpha=0.8, label='Full time')
ax.bar(label, part, width=0.5, color='#ED553B', alpha=0.8, label='Part time')
for i in range(14):
ax.annotate(f"{-int(part[i])}%",xy=(i, part[i]-1),va = 'center', ha='center',fontweight='light', fontfamily='serif',color='#ED553B')
for i in range(14):
ax.annotate(f"{int(full[i])}%",
xy=(i, full[i]+1),
va = 'center', ha='center',fontweight='light', fontfamily='serif',
color='#20639B')
for s in ['top', 'left', 'right', 'bottom']:
ax.spines[s].set_visible(False)
ax.set_xticklabels(label,rotation=45, fontfamily='serif')
ax.set_yticks([])
ax.legend().set_visible(False)
fig.text(0.16, 1, 'Salary band by Full & Study mode', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.825,0.924,"Part", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.815,0.924,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.775,0.924,"Full", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')
plt.show()
The salary distribution differs significantly between full-time and part-time students, with part-time students generally earning higher salaries. This observation is reasonable considering their typically longer work experience compared to full-time students.
df_qual_salary = df[c0 & c1 & c2 & c3n & c4 & c5 & c6 & c7n]
df_qual_salary['Percent'] = df_qual_salary['Percent'].str.strip('%').astype(float)/100
df_qual_salary = df_qual_salary[['Level of qualification obtained', 'Salary band', 'Percent']].copy()
df_qual_salary = df_qual_salary[df_qual_salary['Level of qualification obtained'] != 'Postgraduate unknown']
pivot_qual_salary = pd.pivot_table(df_qual_salary, values='Percent', index='Level of qualification obtained', columns='Salary band', aggfunc='sum')
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
qual = [ 'All undergraduate', 'First degree',
'Other undergraduate','Undergraduate unknown', 'All postgraduate','Postgraduate (research)',
'Postgraduate (taught)']
below_30_colors = ['#00274d', '#004080', '#005cbf', '#007bff', '#6fa2f5', '#a4c7f6', '#d2e4f9', '#f0f7fc']
above_30_colors = ['#f5f5f1', '#d1d1d1', '#a2a2a2', '#737373', '#525252', '#404040', '#2e2e2e', '#1f1f1f']
# Create a colormap
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("custom_cmap",below_30_colors + above_30_colors)
sns.heatmap(pivot_qual_salary.loc[qual,salary],cmap=cmap,square=True, linewidth=2.5,cbar=False,
annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
ax.spines['top'].set_visible(True)
fig.text(.99, .725, 'Salary proportion of qualification obtained', fontweight='bold', fontfamily='serif', fontsize=15,ha='right')
ax.set_yticklabels(ax.get_yticklabels(), fontfamily='serif', rotation = 0, fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)
ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(axis=u'both', which=u'both',length=0)
plt.tight_layout()
plt.show()
The lighter shade of blue represents a higher percentage, and it is noteworthy that postgraduate students tend to have higher salaries, with research postgraduate students earning the highest salaries.
df_skill_salary = df[c0 & c1 & c2 & c3 & c4 & c5n & c6 & c7n]
df_skill_salary['Percent'] = df_skill_salary['Percent'].str.strip('%').astype(float)/100
df_skill_salary = df_skill_salary[['Skill group', 'Salary band', 'Percent']].copy()
percent_skill_salary = pd.pivot_table(df_skill_salary, values='Percent', index='Salary band', columns='Skill group', aggfunc='sum')
fig, ax = plt.subplots(1, 1, figsize=(20, 5))
plt.bar(np.arange(len(percent_skill_salary.index))-0.2, height=percent_skill_salary["High skilled"], zorder=3, color='#20639B', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index))-0.2, percent_skill_salary["High skilled"], zorder=3,s=20, color='#20639B')
plt.bar(np.arange(len(percent_skill_salary.index)), height=percent_skill_salary["Medium skilled"], zorder=3, color='#ED553B', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index)), percent_skill_salary["Medium skilled"], zorder=3,s=20, color='#ED553B')
plt.bar(np.arange(len(percent_skill_salary.index))+0.2, height=percent_skill_salary["Low skilled"], zorder=3, color='#3CAEA3', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index))+0.2, percent_skill_salary["Low skilled"], zorder=3,s=20, color='#3CAEA3')
x_axis = np.arange(len(label))
plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)
plt.xlabel("Salary Band", fontsize=15, fontfamily='serif')
plt.ylabel("Percent", fontsize=15, fontfamily='serif')
for s in ['top', 'left', 'right', 'bottom']:
ax.spines[s].set_visible(False)
fig.text(0.16, 1, 'Salary band by skill groups', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.725,0.8,"High skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')
fig.text(0.725,0.7,"Medium skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.725,0.6,"Low skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#3CAEA3')
plt.show()
It is intuitive to expect that students with high skills would likely have higher salaries than their counterparts.
| Variable | Number of Categories | Comparing by Salary Bands | Modelling Part |
|---|---|---|---|
| Subject area of degree | 22 | Highest: 01 Medicine; Lowest: 22 Design | Choose 10 subjects |
| Country of provider | 4 | Similarly percent in salary bands | Combine them together **[Delete]** |
| Provider type | 2 | Not enough data in FECs | Choose HEPs |
| Level of qualification obtained | 5 | Highest: **Postgraduate (Research)** | Only 4 categories left after choosing HEPs |
| Mode of former study | 2 | **Part-time** tends to have higher salary than full-time | Both of two |
| Skill group | 3 | Highest: **High skilled** | All of three |
| Work population marker | 2 | One contains another | Choose 'Paid employment is an activity' |
| Salary band | 14 | **Need further operation with 'Number'** | |
| Academic year | 1 | '2020/21' | **[Delete]** |
selected_data = df[df['Subject area of degree'] .isin (['01 Medicine and dentistry','09 Mathematical sciences','07 Physical sciences','11 Computing','10 Engineering and technology','16 Law', '17 Business and management','01 Medicine and dentistry', '15 Social sciences','25 Design, and creative and performing arts','22 Education and teaching'])
& (df['Country of provider'] == 'All')
& (df['Provider type'] == 'Higher education providers (HEPs)')
& ((df['Level of qualification obtained'] != 'All undergraduate') & (df['Level of qualification obtained'] != 'All postgraduate') & (df['Level of qualification obtained'] != 'All'))
& (df['Mode of former study'] != 'All')
& (df['Skill group'] != 'All')
& (df['Work population marker'] == 'Paid employment is an activity')
& (df['Salary band'] != 'Total')]
columns_to_remove = ['Academic year','Country of provider','Provider type','Percent','Work population marker']
selected_data = selected_data.drop(columns=columns_to_remove)
For every entry within the selected_data dataset, a duplication procedure is executed based on the numerical value specified in the number column. Each line undergoes replication a number of times corresponding to the value indicated in the aforementioned column.
def generate_samples(row):
count = row['Number']
features = [row['Subject area of degree'],row['Level of qualification obtained'],
row['Mode of former study'], row['Skill group'], row['Salary band']]
samples = np.repeat([features], count, axis=0)
return pd.DataFrame(samples, columns=['Subject area of degree', 'Level of qualification obtained',
'Mode of former study', 'Skill group', 'Salary band'])
class_df = pd.concat(selected_data.apply(generate_samples, axis = 1).tolist(), ignore_index = True)
Drawing upon the findings from the census research on graduate salaries in the United Kingdom for the period 2020-2021, our analysis integrates various contextual factors, including the cost of living and taxation. In light of these considerations, we have established a classification system for salary levels. Accordingly, we categorize low salary levels as those falling below £25,000, consider salary levels between £25,000 and £35,000 as medium, and designate salary levels exceeding £35,000 as high. This categorization framework is then applied to the dataset under investigation, allowing for a nuanced understanding of the distribution of graduate salaries.
class_df['Salary level'] = 0 * len(class_df)
for i in class_df.index:
salary = class_df.loc[i, "Salary band"]
if salary in ['Less than £15,000', '£15,000 - £17,999', '£18,000 - £20,999','£21,000 - £23,999']:
class_df.loc[i, 'Salary level'] = 'Low'
elif salary in ['£24,000 - £26,999', '£27,000 - £29,999', '£30,000 - £32,999', '£33,000 - £35,999']:
class_df.loc[i, 'Salary level'] = 'Med'
else:
class_df.loc[i, 'Salary level'] = 'High'
class_df.head()
| Subject area of degree | Level of qualification obtained | Mode of former study | Skill group | Salary band | Salary level | |
|---|---|---|---|---|---|---|
| 0 | 01 Medicine and dentistry | First degree | Full-time | High skilled | £18,000 - £20,999 | Low |
| 1 | 01 Medicine and dentistry | First degree | Full-time | High skilled | £18,000 - £20,999 | Low |
| 2 | 01 Medicine and dentistry | First degree | Full-time | High skilled | £18,000 - £20,999 | Low |
| 3 | 01 Medicine and dentistry | First degree | Full-time | High skilled | £18,000 - £20,999 | Low |
| 4 | 01 Medicine and dentistry | First degree | Full-time | High skilled | £18,000 - £20,999 | Low |
We employ Subject area of degree, Level of qualification obtained, Mode of former study, Skill group as features, with Salary group serving as the categories for the samples.
We opted for a comparative analysis involving a selection of classifiers, including Logistic regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM) with Radial Basis Function (RBF) kernels, Random Forest, AdaBoost, and Naive Bayes.
# Identify classifiers
classifiers = {
"Logistic Regression": LogisticRegression(max_iter=10000),
"Nearest Neighbour": KNeighborsClassifier(),
"RBF SVM": svm.SVC(kernel = 'rbf', gamma = 0.5, C = 0.1, probability = True),
"Random Forest": RandomForestClassifier(n_estimators = 12, criterion = 'entropy'),
"AdaBoost": AdaBoostClassifier(),
"Naive Bayes": BernoulliNB()
}
print(class_df['Salary level'].value_counts())
Med 55065 Low 24850 High 22755 Name: Salary level, dtype: int64
Samples in the Med salary level are almost twice as numerous as those in the high and low salary groups. Therefore, during the train-test splitting procedure, a resampling method is employed to decrease the number of instances in the majority class (Med) within the training set.
# Spliting train_test data
X = class_df.iloc[:, range(4)]
X = pd.get_dummies(X, columns = ['Subject area of degree', 'Level of qualification obtained',
'Mode of former study', 'Skill group'], drop_first = True)
y = class_df.iloc[:, -1]
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
undersampler = RandomUnderSampler(sampling_strategy = 'majority')
X_train, y_train = undersampler.fit_resample(X_train, y_train)
print(y_train.value_counts())
Low 19825 High 18231 Med 18231 Name: Salary level, dtype: int64
Following the implementation of the undersampling procedure, the dataset exhibits a notable improvement in balance, with the distribution of instances across classes approaching a more equitable distribution.
# Fit and predict the data
for name, classifier in classifiers.items():
classifier.fit(X_train, y_train)
y_preds = {}
y_scores = {}
for name, classifier in classifiers.items():
y_preds[name] = classifier.predict(X_test)
y_scores[name] = classifier.predict_proba(X_test)
# Accuracy
accuracy = {}
for name, y_pred in y_preds.items():
accuracy[name] = accuracy_score(y_test, y_pred)
pd.DataFrame.from_dict(accuracy, orient = "index", columns = ["Accuracy"])
| Accuracy | |
|---|---|
| Logistic Regression | 0.636846 |
| Nearest Neighbour | 0.238726 |
| RBF SVM | 0.641229 |
| Random Forest | 0.639914 |
| AdaBoost | 0.636603 |
| Naive Bayes | 0.561069 |
After conducting a series of rigorous experiments and systematically tuning various parameters across each classifier, it was ultimately ascertained that the Random Forest classifier and RBF SVM achieved the first two highest classification accuracy on the given dataset. The attained accuracy level consistently approached around 64%. The classifier with the lowest accuracy is KNN.
Nevertheless, relying solely on accuracy as a metric for assessing classification performance in the context of highly imbalanced data should be avoided. Instead, alternative evaluation methods will be employed to thoroughly explore and gauge the model's effectiveness in such situations.
The assessment of the classification model's performance involves the construction and analysis of the Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve. Comparative evaluations are conducted across various classifiers to discern variations in performance. The computation of True Positive Rate (TPR) and False Positive Rate (FPR) is imperative in the delineation of these curves.
Given the inherent complexity of the classification task as a multi-class problem, the direct calculation of TPR and FPR is unattainable. In response, the One vs Rest (OvR) multi-class classification strategy is employed to address this challenge. This strategy entails the transformation of the multi-class classification problem into a series of binary classification problems. Each binary classifier is specifically designed to discriminate one class from the rest, allowing for the computation of TPR and FPR in the context of the binary classification. The aggregation of these metrics across all binary classifiers yields the average TPR and FPR, providing a comprehensive evaluation of the classifier's performance.
# OneVsRest classifier
label_binarizer = LabelBinarizer().fit(y_train)
y_onehot_test = label_binarizer.transform(y_test)
classifiers_ovr = {}
for name, classifier in classifiers.items():
classifiers_ovr[name] = OneVsRestClassifier(classifier)
y_score_ovrs = {}
for name, classifier in classifiers_ovr.items():
classifier.fit(X_train, y_train)
y_score_ovrs[name] = classifier.predict_proba(X_test)
# ROC curve
for name, y_score_ovr in y_score_ovrs.items():
fpr, tpr, _ = roc_curve(y_onehot_test.ravel(), y_score_ovr.ravel())
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='black')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.show()
In the ROC curve plot, it can be observed that the Random Forest classifier exhibits the highest AUC, consistently hovering around 0.80. Its ROC curve is also situated closest to the upper-left corner, whereas the algorithm with the poorest classification performance is the KNN method.
# PR curve
for name, y_score_ovr in y_score_ovrs.items():
precision = dict()
recall = dict()
average_precision = dict()
for i in range(len(np.unique(y))):
precision[i], recall[i], _ = precision_recall_curve(y_onehot_test[:, i], y_score_ovr[:, i])
average_precision[i] = average_precision_score(y_onehot_test[:, i], y_score_ovr[:, i])
precision["micro"], recall["micro"], _ = precision_recall_curve(y_onehot_test.ravel(), y_score_ovr.ravel())
average_precision["micro"] = average_precision_score(y_onehot_test, y_score_ovr, average="micro")
plt.plot(recall["micro"], precision["micro"], label=f'{name} (AP = {average_precision["micro"]:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Average Precision-Recall Curve')
plt.legend(loc='center left', bbox_to_anchor = (1, 0.5))
plt.show()
In the average PR curve, the curve corresponding to the Random forest and SVM classifier are predominantly positioned closest to the upper-right corner, achieving an average precision of 64%. Conversely, the KNN classifier exhibits the least favorable classification performance. The evaluation of classifiers for highly imbalanced data benefits more accurately from the PR curve as compared to the ROC curve. The advantages of random forests are particularly pronounced in this context.
The obtained findings align with the theoretical analysis. In the given dataset, the classification features exclusively consist of categorical variables, and the random forest algorithm, grounded in the aggregate voting outcomes of numerous decision trees, emerges as a fundamental classification approach. Demonstrating a high degree of compatibility with discrete data, the random forest method exhibits a notable resilience against overfitting. Notably, KNN displays heightened sensitivity to highly imbalanced data, resulting in the poorest performance on this particular dataset.
y_best_pred = y_preds['Random Forest']
print(classification_report(y_test, y_best_pred)) # Classification report
pd.crosstab(y_test, y_best_pred, rownames = ['Actual Salary level'], colnames = ['Predicted Salary level']) # Presudo confusion matrix
precision recall f1-score support
High 0.59 0.62 0.60 4524
Low 0.56 0.61 0.58 5025
Med 0.71 0.66 0.68 10985
accuracy 0.64 20534
macro avg 0.62 0.63 0.62 20534
weighted avg 0.65 0.64 0.64 20534
| Predicted Salary level | High | Low | Med |
|---|---|---|---|
| Actual Salary level | |||
| High | 2808 | 382 | 1334 |
| Low | 330 | 3068 | 1627 |
| Med | 1647 | 2074 | 7264 |
In this section, rather than comparing salaries within individual categories, our objective is to identify the key features across the entire dataset that significantly influence the salary range. As we previously determined Random Forest as the best model for our classification problem, we will use the best Random Forest model with the smallest Mean Square Error (MSE). Our approach involves comparing the importance scores of each feature and determining which features have a greater impact on our model.
To begin, we create a new dataframe by separating each salary range into separate columns. Same as in the previous classification section, we convert our categorical variables into dummy variables with values 0 and 1. Subsequently, we employ a pivot table to group data with identical features, resulting in a new dataframe where each row represents the number of individuals within each salary range.
new_df = selected_data
new_df = pd.get_dummies(new_df, columns=['Subject area of degree','Level of qualification obtained', 'Mode of former study', 'Skill group'], prefix='', prefix_sep='')
new_df['Salary_band_range'] = new_df['Salary band'].str.split(' - ').str[0]
# Pivot the table by grouping together workers with same features
pivot_df = pd.pivot_table(new_df, values='Number', index=['01 Medicine and dentistry',
'07 Physical sciences', '09 Mathematical sciences',
'10 Engineering and technology', '11 Computing', '15 Social sciences',
'16 Law', '17 Business and management', '22 Education and teaching',
'25 Design, and creative and performing arts', 'First degree',
'Other undergraduate', 'Postgraduate (research)',
'Postgraduate (taught)', 'Full-time', 'Part-time', 'High skilled',
'Low skilled', 'Medium skilled'], columns='Salary_band_range', aggfunc='sum', fill_value=0)
pivot_df.reset_index(inplace=True) # Reset the index
pivot_df.head()
| Salary_band_range | 01 Medicine and dentistry | 07 Physical sciences | 09 Mathematical sciences | 10 Engineering and technology | 11 Computing | 15 Social sciences | 16 Law | 17 Business and management | 22 Education and teaching | 25 Design, and creative and performing arts | ... | £24,000 | £27,000 | £30,000 | £33,000 | £36,000 | £39,000 | £42,000 | £45,000 | £48,000 | £51,000+ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 40 | 15 | 35 | 15 | 15 | 15 | 15 | 10 | 10 | 30 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 20 | 10 | 5 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 125 | 70 | 60 | 30 | 10 | 15 | 10 | 15 | 5 | 20 |
5 rows × 33 columns
Then we normalize the number of each salary range to ensure comparable scales for comparing the importance of our variables.
Y = pivot_df.iloc[:,19:33]
X = pivot_df.iloc[:,0:19]
row_sums = Y.sum(axis=1)
row_sums[row_sums == 0] = 1 # Replace zero row sums with 1 to avoid division by zero
normalized_Y = Y.div(row_sums, axis=0)# Divide each element in the DataFrame by the corresponding row sum
We choose the number of estimators of the Random Forest model by comparing the MSE, the model with the smallest MSE is considered as the 'best' model.
X_train, X_test, y_train, y_test = train_test_split(X, normalized_Y, test_size=0.2, random_state=21)
n_estimators_range = range(1, 101) # Set the range of estimators
mse_values = []
for n_estimators in n_estimators_range:
rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=21)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_values.append(mse)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(n_estimators_range, mse_values, 'ro-')
plt.xlabel('Number of Estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('MSE vs Number of Estimators in Random Forest')
plt.show()
best_n_estimators = n_estimators_range[np.argmin(mse_values)]# Get the number of estimators of our best model
rf_model = RandomForestRegressor(n_estimators=best_n_estimators, random_state=21)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
importances = pd.DataFrame({'Columns':X_train.columns,'Feature_Importances':rf_model.feature_importances_})
importances = importances.sort_values(by='Feature_Importances',ascending=False)
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.barplot(x=importances['Feature_Importances'], y=importances['Columns'], color = '#FF5733')
sns.despine()
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest Feature Importance')
plt.show()
The importance score of each column indicates its contribution to the model's predictions for the associated category, with higher importance indicating a stronger impact on predictions. Here we can find that 'High skilled' is the most important feature, this highlights that individuals classified as "High skilled" in the 'Skill group' variable have a significant impact on predicting the salary range. The second most important feature is Postgraduate (research), which indicates that individuals with a postgraduate research qualification have distinctive characteristics that strongly influence their salary range. The 'Mode of former study' variable with the category "Part-time" is also influential, which implies that individuals who is part-time study exhibit specific patterns that contribute to predicting their salary range. It is worth noting that these important features for determining salary ranges are consistent with those analysed in the previous EDA section, which indicates that our model performs well for predicting salary ranges by given variables.
In this section, we used a Neural Network model to address our multi-class classification problem. By creating three layers, we used the categorical crossentropy loss function and calculated the accuracy for each epoch.
# Sequential model
nn_model = models.Sequential()
nn_model.add(layers.Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(19,)))
nn_model.add(layers.Dense(10,activation='relu', kernel_initializer='he_normal'))
nn_model.add(layers.Dense(14, activation='softmax'))
nn_model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = nn_model.fit(X, normalized_Y, epochs=100, batch_size=32, validation_split=0.3, verbose = 2)
Epoch 1/100 3/3 - 1s - loss: 2.6823 - accuracy: 0.0316 - val_loss: 2.6691 - val_accuracy: 0.0714 - 1s/epoch - 390ms/step Epoch 2/100 3/3 - 0s - loss: 2.6714 - accuracy: 0.0316 - val_loss: 2.6657 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step Epoch 3/100 3/3 - 0s - loss: 2.6617 - accuracy: 0.0737 - val_loss: 2.6626 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step Epoch 4/100 3/3 - 0s - loss: 2.6520 - accuracy: 0.0947 - val_loss: 2.6593 - val_accuracy: 0.0952 - 43ms/epoch - 14ms/step Epoch 5/100 3/3 - 0s - loss: 2.6434 - accuracy: 0.1053 - val_loss: 2.6561 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step Epoch 6/100 3/3 - 0s - loss: 2.6353 - accuracy: 0.1368 - val_loss: 2.6531 - val_accuracy: 0.0952 - 41ms/epoch - 14ms/step Epoch 7/100 3/3 - 0s - loss: 2.6280 - accuracy: 0.1368 - val_loss: 2.6500 - val_accuracy: 0.0952 - 40ms/epoch - 13ms/step Epoch 8/100 3/3 - 0s - loss: 2.6208 - accuracy: 0.1579 - val_loss: 2.6470 - val_accuracy: 0.0952 - 45ms/epoch - 15ms/step Epoch 9/100 3/3 - 0s - loss: 2.6144 - accuracy: 0.1579 - val_loss: 2.6441 - val_accuracy: 0.1190 - 45ms/epoch - 15ms/step Epoch 10/100 3/3 - 0s - loss: 2.6082 - accuracy: 0.1579 - val_loss: 2.6414 - val_accuracy: 0.1190 - 54ms/epoch - 18ms/step Epoch 11/100 3/3 - 0s - loss: 2.6015 - accuracy: 0.1789 - val_loss: 2.6384 - val_accuracy: 0.1190 - 37ms/epoch - 12ms/step Epoch 12/100 3/3 - 0s - loss: 2.5955 - accuracy: 0.1789 - val_loss: 2.6354 - val_accuracy: 0.0952 - 59ms/epoch - 20ms/step Epoch 13/100 3/3 - 0s - loss: 2.5898 - accuracy: 0.1789 - val_loss: 2.6326 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step Epoch 14/100 3/3 - 0s - loss: 2.5838 - accuracy: 0.1789 - val_loss: 2.6296 - val_accuracy: 0.0714 - 59ms/epoch - 20ms/step Epoch 15/100 3/3 - 0s - loss: 2.5777 - accuracy: 0.1789 - val_loss: 2.6267 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step Epoch 16/100 3/3 - 0s - loss: 2.5718 - accuracy: 0.2000 - val_loss: 2.6237 - val_accuracy: 0.0714 - 39ms/epoch - 13ms/step Epoch 17/100 3/3 - 0s - loss: 2.5654 - accuracy: 0.2105 - val_loss: 2.6204 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step Epoch 18/100 3/3 - 0s - loss: 2.5592 - accuracy: 0.2211 - val_loss: 2.6172 - val_accuracy: 0.0714 - 42ms/epoch - 14ms/step Epoch 19/100 3/3 - 0s - loss: 2.5530 - accuracy: 0.2526 - val_loss: 2.6141 - val_accuracy: 0.0714 - 37ms/epoch - 12ms/step Epoch 20/100 3/3 - 0s - loss: 2.5461 - accuracy: 0.2526 - val_loss: 2.6109 - val_accuracy: 0.0714 - 37ms/epoch - 12ms/step Epoch 21/100 3/3 - 0s - loss: 2.5395 - accuracy: 0.2632 - val_loss: 2.6076 - val_accuracy: 0.0714 - 60ms/epoch - 20ms/step Epoch 22/100 3/3 - 0s - loss: 2.5326 - accuracy: 0.2737 - val_loss: 2.6042 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step Epoch 23/100 3/3 - 0s - loss: 2.5251 - accuracy: 0.2842 - val_loss: 2.6007 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step Epoch 24/100 3/3 - 0s - loss: 2.5180 - accuracy: 0.2737 - val_loss: 2.5972 - val_accuracy: 0.0714 - 56ms/epoch - 19ms/step Epoch 25/100 3/3 - 0s - loss: 2.5101 - accuracy: 0.2737 - val_loss: 2.5934 - val_accuracy: 0.0714 - 44ms/epoch - 15ms/step Epoch 26/100 3/3 - 0s - loss: 2.5026 - accuracy: 0.2632 - val_loss: 2.5895 - val_accuracy: 0.0952 - 42ms/epoch - 14ms/step Epoch 27/100 3/3 - 0s - loss: 2.4945 - accuracy: 0.2526 - val_loss: 2.5855 - val_accuracy: 0.0952 - 40ms/epoch - 13ms/step Epoch 28/100 3/3 - 0s - loss: 2.4864 - accuracy: 0.2421 - val_loss: 2.5815 - val_accuracy: 0.0714 - 48ms/epoch - 16ms/step Epoch 29/100 3/3 - 0s - loss: 2.4775 - accuracy: 0.2421 - val_loss: 2.5776 - val_accuracy: 0.1190 - 54ms/epoch - 18ms/step Epoch 30/100 3/3 - 0s - loss: 2.4690 - accuracy: 0.2421 - val_loss: 2.5733 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step Epoch 31/100 3/3 - 0s - loss: 2.4598 - accuracy: 0.2421 - val_loss: 2.5690 - val_accuracy: 0.1190 - 55ms/epoch - 18ms/step Epoch 32/100 3/3 - 0s - loss: 2.4499 - accuracy: 0.2526 - val_loss: 2.5644 - val_accuracy: 0.1190 - 42ms/epoch - 14ms/step Epoch 33/100 3/3 - 0s - loss: 2.4404 - accuracy: 0.2632 - val_loss: 2.5601 - val_accuracy: 0.1190 - 57ms/epoch - 19ms/step Epoch 34/100 3/3 - 0s - loss: 2.4304 - accuracy: 0.2526 - val_loss: 2.5563 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step Epoch 35/100 3/3 - 0s - loss: 2.4207 - accuracy: 0.2632 - val_loss: 2.5525 - val_accuracy: 0.1429 - 56ms/epoch - 19ms/step Epoch 36/100 3/3 - 0s - loss: 2.4114 - accuracy: 0.2737 - val_loss: 2.5489 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step Epoch 37/100 3/3 - 0s - loss: 2.4016 - accuracy: 0.2632 - val_loss: 2.5456 - val_accuracy: 0.0952 - 42ms/epoch - 14ms/step Epoch 38/100 3/3 - 0s - loss: 2.3924 - accuracy: 0.2526 - val_loss: 2.5419 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step Epoch 39/100 3/3 - 0s - loss: 2.3835 - accuracy: 0.2526 - val_loss: 2.5382 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step Epoch 40/100 3/3 - 0s - loss: 2.3743 - accuracy: 0.2632 - val_loss: 2.5346 - val_accuracy: 0.0952 - 39ms/epoch - 13ms/step Epoch 41/100 3/3 - 0s - loss: 2.3655 - accuracy: 0.2632 - val_loss: 2.5308 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step Epoch 42/100 3/3 - 0s - loss: 2.3565 - accuracy: 0.2526 - val_loss: 2.5266 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step Epoch 43/100 3/3 - 0s - loss: 2.3485 - accuracy: 0.2526 - val_loss: 2.5230 - val_accuracy: 0.0952 - 37ms/epoch - 12ms/step Epoch 44/100 3/3 - 0s - loss: 2.3399 - accuracy: 0.2316 - val_loss: 2.5188 - val_accuracy: 0.0952 - 36ms/epoch - 12ms/step Epoch 45/100 3/3 - 0s - loss: 2.3313 - accuracy: 0.2421 - val_loss: 2.5145 - val_accuracy: 0.0952 - 41ms/epoch - 14ms/step Epoch 46/100 3/3 - 0s - loss: 2.3233 - accuracy: 0.2526 - val_loss: 2.5103 - val_accuracy: 0.1190 - 38ms/epoch - 13ms/step Epoch 47/100 3/3 - 0s - loss: 2.3152 - accuracy: 0.2526 - val_loss: 2.5058 - val_accuracy: 0.1190 - 44ms/epoch - 15ms/step Epoch 48/100 3/3 - 0s - loss: 2.3074 - accuracy: 0.2421 - val_loss: 2.5013 - val_accuracy: 0.1429 - 55ms/epoch - 18ms/step Epoch 49/100 3/3 - 0s - loss: 2.2992 - accuracy: 0.2526 - val_loss: 2.4969 - val_accuracy: 0.1190 - 66ms/epoch - 22ms/step Epoch 50/100 3/3 - 0s - loss: 2.2917 - accuracy: 0.2947 - val_loss: 2.4930 - val_accuracy: 0.1190 - 56ms/epoch - 19ms/step Epoch 51/100 3/3 - 0s - loss: 2.2840 - accuracy: 0.3053 - val_loss: 2.4887 - val_accuracy: 0.1190 - 47ms/epoch - 16ms/step Epoch 52/100 3/3 - 0s - loss: 2.2769 - accuracy: 0.3263 - val_loss: 2.4840 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step Epoch 53/100 3/3 - 0s - loss: 2.2699 - accuracy: 0.3263 - val_loss: 2.4797 - val_accuracy: 0.0952 - 37ms/epoch - 12ms/step Epoch 54/100 3/3 - 0s - loss: 2.2625 - accuracy: 0.3368 - val_loss: 2.4748 - val_accuracy: 0.1429 - 60ms/epoch - 20ms/step Epoch 55/100 3/3 - 0s - loss: 2.2557 - accuracy: 0.3368 - val_loss: 2.4707 - val_accuracy: 0.1429 - 56ms/epoch - 19ms/step Epoch 56/100 3/3 - 0s - loss: 2.2491 - accuracy: 0.3789 - val_loss: 2.4667 - val_accuracy: 0.1429 - 55ms/epoch - 18ms/step Epoch 57/100 3/3 - 0s - loss: 2.2428 - accuracy: 0.4211 - val_loss: 2.4623 - val_accuracy: 0.2143 - 55ms/epoch - 18ms/step Epoch 58/100 3/3 - 0s - loss: 2.2361 - accuracy: 0.4105 - val_loss: 2.4585 - val_accuracy: 0.2381 - 55ms/epoch - 18ms/step Epoch 59/100 3/3 - 0s - loss: 2.2303 - accuracy: 0.4211 - val_loss: 2.4545 - val_accuracy: 0.2381 - 38ms/epoch - 13ms/step Epoch 60/100 3/3 - 0s - loss: 2.2243 - accuracy: 0.4526 - val_loss: 2.4506 - val_accuracy: 0.2381 - 40ms/epoch - 13ms/step Epoch 61/100 3/3 - 0s - loss: 2.2188 - accuracy: 0.4632 - val_loss: 2.4471 - val_accuracy: 0.2381 - 55ms/epoch - 18ms/step Epoch 62/100 3/3 - 0s - loss: 2.2130 - accuracy: 0.4421 - val_loss: 2.4436 - val_accuracy: 0.2381 - 38ms/epoch - 13ms/step Epoch 63/100 3/3 - 0s - loss: 2.2077 - accuracy: 0.4526 - val_loss: 2.4407 - val_accuracy: 0.2143 - 54ms/epoch - 18ms/step Epoch 64/100 3/3 - 0s - loss: 2.2019 - accuracy: 0.4526 - val_loss: 2.4376 - val_accuracy: 0.2619 - 40ms/epoch - 13ms/step Epoch 65/100 3/3 - 0s - loss: 2.1973 - accuracy: 0.4526 - val_loss: 2.4348 - val_accuracy: 0.2619 - 40ms/epoch - 13ms/step Epoch 66/100 3/3 - 0s - loss: 2.1921 - accuracy: 0.4737 - val_loss: 2.4323 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step Epoch 67/100 3/3 - 0s - loss: 2.1880 - accuracy: 0.4842 - val_loss: 2.4286 - val_accuracy: 0.3095 - 71ms/epoch - 24ms/step Epoch 68/100 3/3 - 0s - loss: 2.1830 - accuracy: 0.4947 - val_loss: 2.4263 - val_accuracy: 0.3095 - 61ms/epoch - 20ms/step Epoch 69/100 3/3 - 0s - loss: 2.1789 - accuracy: 0.5368 - val_loss: 2.4237 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step Epoch 70/100 3/3 - 0s - loss: 2.1743 - accuracy: 0.5474 - val_loss: 2.4209 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step Epoch 71/100 3/3 - 0s - loss: 2.1694 - accuracy: 0.5474 - val_loss: 2.4184 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step Epoch 72/100 3/3 - 0s - loss: 2.1651 - accuracy: 0.5474 - val_loss: 2.4157 - val_accuracy: 0.3095 - 54ms/epoch - 18ms/step Epoch 73/100 3/3 - 0s - loss: 2.1609 - accuracy: 0.5684 - val_loss: 2.4132 - val_accuracy: 0.3095 - 45ms/epoch - 15ms/step Epoch 74/100 3/3 - 0s - loss: 2.1565 - accuracy: 0.5579 - val_loss: 2.4107 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step Epoch 75/100 3/3 - 0s - loss: 2.1525 - accuracy: 0.5474 - val_loss: 2.4089 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step Epoch 76/100 3/3 - 0s - loss: 2.1485 - accuracy: 0.5474 - val_loss: 2.4075 - val_accuracy: 0.3095 - 40ms/epoch - 13ms/step Epoch 77/100 3/3 - 0s - loss: 2.1452 - accuracy: 0.5579 - val_loss: 2.4067 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step Epoch 78/100 3/3 - 0s - loss: 2.1415 - accuracy: 0.5579 - val_loss: 2.4062 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step Epoch 79/100 3/3 - 0s - loss: 2.1380 - accuracy: 0.5579 - val_loss: 2.4065 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step Epoch 80/100 3/3 - 0s - loss: 2.1347 - accuracy: 0.5684 - val_loss: 2.4064 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step Epoch 81/100 3/3 - 0s - loss: 2.1316 - accuracy: 0.5684 - val_loss: 2.4050 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step Epoch 82/100 3/3 - 0s - loss: 2.1285 - accuracy: 0.5684 - val_loss: 2.4047 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step Epoch 83/100 3/3 - 0s - loss: 2.1256 - accuracy: 0.5789 - val_loss: 2.4036 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step Epoch 84/100 3/3 - 0s - loss: 2.1226 - accuracy: 0.5789 - val_loss: 2.4037 - val_accuracy: 0.3095 - 40ms/epoch - 13ms/step Epoch 85/100 3/3 - 0s - loss: 2.1199 - accuracy: 0.5895 - val_loss: 2.4036 - val_accuracy: 0.3095 - 42ms/epoch - 14ms/step Epoch 86/100 3/3 - 0s - loss: 2.1171 - accuracy: 0.5895 - val_loss: 2.4034 - val_accuracy: 0.3095 - 41ms/epoch - 14ms/step Epoch 87/100 3/3 - 0s - loss: 2.1147 - accuracy: 0.5895 - val_loss: 2.4033 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step Epoch 88/100 3/3 - 0s - loss: 2.1121 - accuracy: 0.6000 - val_loss: 2.4031 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step Epoch 89/100 3/3 - 0s - loss: 2.1098 - accuracy: 0.6105 - val_loss: 2.4026 - val_accuracy: 0.2857 - 38ms/epoch - 13ms/step Epoch 90/100 3/3 - 0s - loss: 2.1076 - accuracy: 0.6105 - val_loss: 2.4027 - val_accuracy: 0.2857 - 58ms/epoch - 19ms/step Epoch 91/100 3/3 - 0s - loss: 2.1051 - accuracy: 0.6105 - val_loss: 2.4027 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step Epoch 92/100 3/3 - 0s - loss: 2.1030 - accuracy: 0.6105 - val_loss: 2.4026 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step Epoch 93/100 3/3 - 0s - loss: 2.1009 - accuracy: 0.6105 - val_loss: 2.4017 - val_accuracy: 0.3095 - 61ms/epoch - 20ms/step Epoch 94/100 3/3 - 0s - loss: 2.0991 - accuracy: 0.6211 - val_loss: 2.4006 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step Epoch 95/100 3/3 - 0s - loss: 2.0970 - accuracy: 0.6211 - val_loss: 2.4009 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step Epoch 96/100 3/3 - 0s - loss: 2.0950 - accuracy: 0.6211 - val_loss: 2.4004 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step Epoch 97/100 3/3 - 0s - loss: 2.0932 - accuracy: 0.6211 - val_loss: 2.4008 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step Epoch 98/100 3/3 - 0s - loss: 2.0915 - accuracy: 0.6211 - val_loss: 2.4011 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step Epoch 99/100 3/3 - 0s - loss: 2.0897 - accuracy: 0.6211 - val_loss: 2.4005 - val_accuracy: 0.3095 - 54ms/epoch - 18ms/step Epoch 100/100 3/3 - 0s - loss: 2.0881 - accuracy: 0.6211 - val_loss: 2.4004 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step
# Visualising model's performance by the loss funtion
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.plot(np.arange(100), history.history['loss'], 'b-', label='loss')
xlab, ylab = ax.set_xlabel('Epoch'), ax.set_ylabel('Loss')
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.plot(np.arange(100), history.history['accuracy'], 'b-', label='accuracy')
xlab, ylab = ax.set_xlabel('Epoch'), ax.set_ylabel('Accuracy')
From these two graphs, we can infer that each epoch results in a lower loss and higher accuracy. However, it is worth noting that the peak accuracy reaches only around 0.6, indicating that our model is not highly accurate in predicting salary ranges. This limitation is due to our dataset being entirely dummy data and the chosen of hyperparameters and activation functions. In our future actions, we plan to optimize these parameters to enhance the performance of the neural network model.
We have achieved our objectives and done some interesting findings.
In summary,
(a) The exploratory data analysis (EDA) conducted in this study compared the distribution of graduates across different categories within each variable. Furthermore, an exploration was undertaken by comparing the percentage of graduates in relation to both salary bands and categories within each variable, which indicated postgraduate (research), part-time, and high-skilled students might earn a higher salary within their respective categories.
(b) The random forest classifier emerges as the most apt model for handling this particular dataset, exhibiting an average accuracy of approximately 64% and an AUC value of 0.79. In contrast to alternative classification models, it demonstrates enhanced robustness, effectively addressing the challenges inherent in predicting categories within this dataset.
(c) We conducted a comparison of the impact of various characteristics on salary ranges throughout the dataset. Notably, we identified 'highly skilled,' 'postgraduate (research)', and 'part-time' as the top three important variables. Interestingly, these findings align with the observations made in the exploratory data analysis (EDA) section, indicating a good performance of our model.
In subsequent research, it is plausible to transform each salary category into a continuous variable, facilitating data analysis and modeling wherein the annual income variable is treated as a continuous entity. Simultaneously, we aim to enhance our reports by delving deeper into data visualization tools and identifying more suitable models.
SUBIN AN (2019). The Hitchhiker’s Guide to the Kaggle. [online] kaggle.com. Available at: https://www.kaggle.com/code/subinium/the-hitchhiker-s-guide-to-the-kaggle
Josh (2021). Netflix Data Visualization. [online] kaggle.com. Available at: https://www.kaggle.com/code/joshuaswords/netflix-data-visualization
JOSH (2021). Awesome HR Data Visualization & Prediction. [online] kaggle.com. Available at: https://www.kaggle.com/code/joshuaswords/awesome-hr-data-visualization-prediction
James, Gareth, et al (2023). An Introduction to Statistical Learning: With Applications in Python . Springer Nature. Available at: https://www.statlearning.com/